Big Data Computing by Vivek Kale

Big Data Computing by Vivek Kale

Author:Vivek Kale [Kale, Vivek]
Language: eng
Format: epub
Published: 2016-10-31T13:20:28+00:00


234

Big Data Computing

of a reliable distributed computing approach that would scale to the demand of the vast

amount of website data that the tool would be collecting. A year later, Google published

papers on the GFS and MapReduce, an algorithm and distributed programming plat-

form for processing large data sets; they utilized large numbers of commodity servers

and built GFS and MapReduce in a way that assumed hardware failures would be com-

monplace and were simply something that the software needed to deal with.

Hadoop was modeled after two papers produced by Google, one of the many

companies to have these kinds of data-intensive processing problems. The

first, presented in 2003, describes a pragmatic, scalable, distributed file sys-

tem optimized for storing enormous data sets called the Google Filesystem,

or GFS. In addition to simple storage, GFS was built to support large-scale,

data-intensive, distributed processing applications. The following year, another

paper, titled “Map-Reduce: Simplified Data Processing on Large Clusters,” was pre-

sented, defining a programming model and accompanying framework that provided

automatic parallelization, fault tolerance, and the scale to process hundreds of tera-

bytes of data in a single job over thousands of machines. When paired, these two

systems could be used to build large data processing clusters on relatively inexpen-

sive commodity machines. These papers directly inspired the development of

Hadoop Distributed File System and Hadoop MapReduce, respectively.

10.3.1 Apache Hadoop

In 2006, after struggling with the same “big data” challenges related to indexing massive amounts of information for its search engine, and after watching the progress of the Nutch project, Yahoo! hired Doug Cutting and decided to adopt Hadoop as its distributed framework for solving its search engine challenges. Yahoo! spun out the storage and processing parts of Nutch to form Hadoop as an open source Apache project, and the Nutch web

crawler remained its own separate project. Shortly thereafter, Yahoo! began rolling out

Hadoop as a means to power analytics for various production applications. The platform

was so effective that Yahoo! merged its search and advertising into one unit to better leverage Hadoop technology.

In the past 10 years, Hadoop has evolved from its search engine-related origins to one

of the most popular general-purpose computing platforms for solving big data challenges.

It is rapidly becoming the foundation for the next generation of data-based applications.

It is predicted that Hadoop will be driving a big data market that should hit more than $23

billion by 2016. Since the launch of the first Hadoop-centered company, Cloudera, in 2008, dozens of Hadoop-based start-ups have attracted hundreds of millions of dollars in venture capital investment. Simply put, organizations have found that Hadoop offers a proven approach to big data analytics.

Apache Hadoop has revolutionized data management and processing. Hadoop’s techni-

cal capabilities have made it possible for organizations across a range of industries to solve problems that were previously impractical. These capabilities include the following:

1. Scalable processing of massive amounts of data on commodity hardware

2.

Flexibility for data processing, regardless of the format and structure (or lack of

structure) of the data



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.